With the breakthrough of AlphaGo, deep reinforcement learning becomes a recognized technique for solving sequential decision-making problems. Despite its reputation, data inefficiency caused by its trial and error learning mechanism makes deep reinforcement learning hard to be practical in a wide range of areas. Plenty of methods have been developed for sample efficient deep reinforcement learning, such as environment modeling, experience transfer, and distributed modifications, amongst which, distributed deep reinforcement learning has shown its potential in various applications, such as human-computer gaming, and intelligent transportation. In this paper, we conclude the state of this exciting field, by comparing the classical distributed deep reinforcement learning methods, and studying important components to achieve efficient distributed learning, covering single player single agent distributed deep reinforcement learning to the most complex multiple players multiple agents distributed deep reinforcement learning. Furthermore, we review recently released toolboxes that help to realize distributed deep reinforcement learning without many modifications of their non-distributed versions. By analyzing their strengths and weaknesses, a multi-player multi-agent distributed deep reinforcement learning toolbox is developed and released, which is further validated on Wargame, a complex environment, showing usability of the proposed toolbox for multiple players and multiple agents distributed deep reinforcement learning under complex games. Finally, we try to point out challenges and future trends, hoping this brief review can provide a guide or a spark for researchers who are interested in distributed deep reinforcement learning.
translated by 谷歌翻译
In this paper, we study the problem of visual grounding by considering both phrase extraction and grounding (PEG). In contrast to the previous phrase-known-at-test setting, PEG requires a model to extract phrases from text and locate objects from images simultaneously, which is a more practical setting in real applications. As phrase extraction can be regarded as a $1$D text segmentation problem, we formulate PEG as a dual detection problem and propose a novel DQ-DETR model, which introduces dual queries to probe different features from image and text for object prediction and phrase mask prediction. Each pair of dual queries is designed to have shared positional parts but different content parts. Such a design effectively alleviates the difficulty of modality alignment between image and text (in contrast to a single query design) and empowers Transformer decoder to leverage phrase mask-guided attention to improve performance. To evaluate the performance of PEG, we also propose a new metric CMAP (cross-modal average precision), analogous to the AP metric in object detection. The new metric overcomes the ambiguity of Recall@1 in many-box-to-one-phrase cases in phrase grounding. As a result, our PEG pre-trained DQ-DETR establishes new state-of-the-art results on all visual grounding benchmarks with a ResNet-101 backbone. For example, it achieves $91.04\%$ and $83.51\%$ in terms of recall rate on RefCOCO testA and testB with a ResNet-101 backbone. Code will be availabl at \url{https://github.com/IDEA-Research/DQ-DETR}.
translated by 谷歌翻译
Perception algorithms in autonomous driving systems confront great challenges in long-tail traffic scenarios, where the problems of Safety of the Intended Functionality (SOTIF) could be triggered by the algorithm performance insufficiencies and dynamic operational environment. However, such scenarios are not systematically included in current open-source datasets, and this paper fills the gap accordingly. Based on the analysis and enumeration of trigger conditions, a high-quality diverse dataset is released, including various long-tail traffic scenarios collected from multiple resources. Considering the development of probabilistic object detection (POD), this dataset marks trigger sources that may cause perception SOTIF problems in the scenarios as key objects. In addition, an evaluation protocol is suggested to verify the effectiveness of POD algorithms in identifying the key objects via uncertainty. The dataset never stops expanding, and the first batch of open-source data includes 1126 frames with an average of 2.27 key objects and 2.47 normal objects in each frame. To demonstrate how to use this dataset for SOTIF research, this paper further quantifies the perception SOTIF entropy to confirm whether a scenario is unknown and unsafe for a perception system. The experimental results show that the quantified entropy can effectively and efficiently reflect the failure of the perception algorithm.
translated by 谷歌翻译
与卷积神经网络(CNN)相比,视觉变压器(VIT)表现出了有希望的性能,但是VIT的训练比CNN难得多。在本文中,我们定义了几个指标,包括动态数据比例(DDP)和知识同化率(KAR),以研究训练过程,并将其分为三个时期:形成,增长和探索。特别是,在训练的最后阶段,我们观察到只有很小的训练示例用于优化模型。鉴于VIT的数据渴望的性质,我们提出了一个简单但重要的问题:在培训的每个阶段,是否有可能提供丰富的``有效''培训示例吗?为了解决这个问题,我们需要解决两个关键问题,即\ ie,如何衡量单个培训示例的``有效性'',以及如何系统地生成足够数量的``有效''示例。为了回答第一个问题,我们发现训练样本的``困难''可以作为衡量培训样本的``有效性''的指标。为了解决第二个问题,我们建议在这些演化阶段动态调整训练数据的``难度''分布。为了实现这两个目的,我们提出了一个新颖的以数据为中心的VIT培训框架,以动态测量训练样本的``难度'',并为不同培训阶段的模型生成``有效的''样品。此外,为了进一步扩大``有效''样品的数量,并减轻了VIT的后期训练阶段的过度拟合问题,我们提出了一种称为Patcherasing的补丁级擦除策略。广泛的实验证明了提出的以数据为中心的VIT培训框架和技术的有效性。
translated by 谷歌翻译
视频和文本之间的跨模式检索因网络上的视频迅速出现而越来越多。通常,视频包含丰富的实例和事件信息,查询文本仅描述了信息的一部分。因此,视频可以对应于多个不同的文本说明和查询。我们将此现象称为``视频文本对应歧义''问题。当前技术主要集中于挖掘视频和文本内容之间的本地或多级对齐(\ textit {e.g。},对实体和动词的动作对象)。这些方法很难通过仅使用一个单个功能来描述视频来减轻视频文本的歧义,这需要同时与多个不同的文本功能匹配。为了解决这个问题,我们提出了一个文本自适应多个视觉原型匹配模型,该模型会自动捕获多个原型,以通过自适应聚合视频令牌功能来描述视频。给定查询文本,相似性由最相似的原型确定,以在视频中找到对应关系,该视频称为文本自适应匹配。为了学习代表视频中丰富信息的多种原型,我们提出了差异损失,以鼓励不同的原型参与视频的不同内容。我们的方法在四个公共视频检索数据集上优于最先进的方法。
translated by 谷歌翻译
我们研究了可靠的功能表示的任务,旨在在多个数据集上良好地概括以进行行动识别。我们建立了有关变形金刚的功效的方法。尽管在过去的十年中,我们目睹了视频动作识别的巨大进展,但如何培训单个模型可以在多个数据集中表现良好的单一模型仍然充满挑战而有价值。在这里,我们提出了一种新颖的多数据集训练范式,Multitrain,设计了两个新的损失条款,即信息丰富的损失和投射损失,旨在学习稳健的表现以进行行动识别。特别是,信息性损失最大化了功能嵌入的表现力,而每个数据集的投影损失遍历了数据集的类之间的内在关系。我们验证方法对五个具有挑战性的数据集的有效性,即动力学400,动力学700,矩矩,活动网络和某种效果 - v2数据集。广泛的实验结果表明,我们的方法可以始终如一地提高最新性能。
translated by 谷歌翻译
与大脑变化相关的阿尔茨海默氏病(AD)和轻度认知障碍(MCI)的评估仍然是一项艰巨的任务。最近的研究表明,多模式成像技术的组合可以更好地反映病理特征,并有助于更准确地诊断AD和MCI。在本文中,我们提出了一种新型的基于张量的多模式特征选择和回归方法,用于诊断和生物标志物对正常对照组的AD和MCI鉴定。具体而言,我们利用张量结构来利用多模式数据中固有的高级相关信息,并研究多线性回归模型中的张量级稀疏性。我们使用三种成像方式(VBM- MRI,FDG-PET和AV45-PET)具有疾病严重程度和认知评分的临床参数来分析ADNI数据的方法的实际优势。实验结果表明,我们提出的方法与疾病诊断的最新方法的优越性能以及疾病特异性区域和与模态相关的差异的鉴定。这项工作的代码可在https://github.com/junfish/bios22上公开获得。
translated by 谷歌翻译
预期周围车辆的车道变化意图对于自动驾驶系统中的有效且安全的驾驶决策至关重要。以前的作品通常采用物理变量,例如驾驶速度,加速度等进行车道变更分类。但是,物理变量不包含语义信息。尽管3D CNN正在迅速开发,但使用动作识别模型和泳道更改识别的外观特征的方法数量很低,并且它们都需要其他信息来预处理数据。在这项工作中,我们提出了一个端到端框架,包括使用相机收集的视频数据,包括两种用于车道变更识别的动作识别方法。我们的方法仅使用预防数据集的RGB视频数据来实现最佳的车道变更分类结果。类激活图表明,动作识别模型可以有效提取车道变更运动。本文还提出了一种更好地提取运动线索的方法。
translated by 谷歌翻译
促性腺营养蛋白释放激素受体(GNRH1R)是治疗子宫疾病的有前途的治疗靶标。迄今为止,在临床研究中可以使用几个GNRH1R拮抗剂,而不满足多个财产约束。为了填补这一空白,我们旨在开发一个基于学习的框架,以促进有效,有效地发现具有理想特性的新的口服小型分子药物靶向GNRH1R。在目前的工作中,首先通过充分利用已知活性化合物和靶蛋白的结构的信息,首先提出了配体和结构组合模型,即LS-Molgen,首先提出了分子生成的方法,该信息通过其出色的性能证明了这一点。比分别基于配体或结构方法。然后,进行了A中的计算机筛选,包括活性预测,ADMET评估,分子对接和FEP计算,其中约30,000个生成的新型分子被缩小到8,以进行实验合成和验证。体外和体内实验表明,其中三个表现出有效的抑制活性(化合物5 IC50 = 0.856 nm,化合物6 IC50 = 0.901 nm,化合物7 IC50 = 2.54 nm对GNRH1R,并且化合物5在基本PK属性中表现良好例如半衰期,口服生物利用度和PPB等。我们认为,提议的配体和结构组合结合的分子生成模型和整个计算机辅助工作流程可能会扩展到从头开始的类似任务或铅优化的类似任务。
translated by 谷歌翻译
联合超分辨率和反音调映射(联合SR-ITM)旨在增加低分辨率和标准动态范围图像的分辨率和动态范围。重点方法主要是诉诸图像分解技术,使用多支化的网络体系结构。 ,这些方法采用的刚性分解在很大程度上将其力量限制在各种图像上。为了利用其潜在能力,在本文中,我们将分解机制从图像域概括为更广泛的特征域。为此,我们提出了一个轻巧的特征分解聚合网络(FDAN)。特别是,我们设计了一个功能分解块(FDB),可以实现功能细节和对比度的可学习分离。通过级联FDB,我们可以建立一个用于强大的多级特征分解的分层功能分解组。联合SR-ITM,\ ie,SRITM-4K的新基准数据集,该数据集是大规模的,为足够的模型培训和评估提供了多功能方案。两个基准数据集的实验结果表明,我们的FDAN表明我们的FDAN有效,并且胜过了以前的方法sr-itm.ar代码和数据集将公开发布。
translated by 谷歌翻译